Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart LXD on MicroCloud start to refresh symlinks #136

Merged
merged 3 commits into from
Jul 24, 2023

Conversation

masnax
Copy link
Contributor

@masnax masnax commented Jul 12, 2023

Adds a hook that restarts LXD on uninitialized MicroCloud systems when MicroCloud starts.

I'm not sure if immediately calling GetServer like I'm doing here will cause issues on slower machines, it seems to work fine on my machine.

This still has the caveat of requiring a snap restart microcloud to force LXD to restart. I'm open to also putting this restart on run on an execution of microcloud init although I'm concerned that have some odd effects on slower machines again.

@tomponline
Copy link
Member

Please can you update the PR description to explain what prompted this change, what does it fix? Thanks

@masnax
Copy link
Contributor Author

masnax commented Jul 12, 2023

@tomponline

The main issue is that when MicroCloud directs LXD to set up an OVN network with MicroOVN, LXD uses some symlinks in its snap environment to execute the commands. These symlinks are generated by the LXD snap's daemon start hook, so if LXD happens to start before MicroOVN is installed, LXD won't have the right symlinks, and setting up the network will fail (Resulting in the error in the main post of #107).

To fix this, LXD needs to be restarted when MicroCloud runs, so it has the right symlinks.

The problems here are:

  • MicroCloud actually needs LXD to restart before a call to microcloud init, restarting on daemon start might not be enough if MicroOVN is installed after MicroCloud.
  • LXD restarting may not solve the problem in the microcloud add case because if the local system needs to restart its LXD, then a node in the cluster has to go down. Additionally, the node being added might need to be signalled to restart its LXD as well.

The other solution (as we briefly discussed on IRC) would be to use the proper paths dynamically in LXD. This would solve the problems above, but would add split the central location where we track the paths of services that LXD works with, and integrate it into the code.

@markylaing
Copy link
Contributor

markylaing commented Jul 13, 2023

@tomponline

The main issue is that when MicroCloud directs LXD to set up an OVN network with MicroOVN, LXD uses some symlinks in its snap environment to execute the commands. These symlinks are generated by the LXD snap's daemon start hook, so if LXD happens to start before MicroOVN is installed, LXD won't have the right symlinks, and setting up the network will fail (Resulting in the error in the main post of #107).

To fix this, LXD needs to be restarted when MicroCloud runs, so it has the right symlinks.

The problems here are:

  • MicroCloud actually needs LXD to restart before a call to microcloud init, restarting on daemon start might not be enough if MicroOVN is installed after MicroCloud.
  • LXD restarting may not solve the problem in the microcloud add case because if the local system needs to restart its LXD, then a node in the cluster has to go down. Additionally, the node being added might need to be signalled to restart its LXD as well.

The other solution (as we briefly discussed on IRC) would be to use the proper paths dynamically in LXD. This would solve the problems above, but would add split the central location where we track the paths of services that LXD works with, and integrate it into the code.

I just want to double check I'm understanding this correctly:

  • If microOVN is installed before LXD, LXD can detect that and will generate symlinks that are required to configure OVN? But if it is installed after LXD the symlinks are incorrect and we need to restart LXD to re-generate them?
  • We can't restart LXD during microcloud init. Why might this not be enough?
  • We can't restart LXD during microcloud add. Why do we need to restart the local LXD when adding a new member? Can we not check if we have an OVN service and if so restart LXD in the new member?

Also where are these hooks defined, and what happens if microOVN is installed after the daemon is started?

@markylaing
Copy link
Contributor

Also a side note - can we add an internal endpoint to LXD to ask it to refresh it's symlinks without requiring a restart?

@masnax
Copy link
Contributor Author

masnax commented Jul 13, 2023

I just want to double check I'm understanding this correctly:

  • If microOVN is installed after LXD, the LXD snap (but presumably not LXD itself?) can detect that and will generate symlinks (to what?) that are required to configure OVN?

The LXD snap creates the symlinks only when the LXD snap starts. If microovn is installed after LXD, these symlinks will only be set the next time the LXD snap starts.

  • We can't restart LXD during microcloud init. Why might this not be enough?

This would solve the problem for the init case, but it would cause a consistent delay when a user initially runs microcloud init. Not sure how bad this would be on slower systems.

  • We can't restart LXD during microcloud add. Why do we need to restart the local LXD when adding a new member? Can we not check if we have an OVN service and if so restart LXD in the new member?

So when joining, we perform some actions both on the joining system that sends the join request and the clustered system that eventually handles it (dqlite leader, I believe). I believe currently the symlinks we care about are just for setting up local services for ovn, but technically both the cluster and the joiner might potentially want to use some of these paths.

Also where are these hooks defined, and what happens if microOVN is installed after the daemon is started?

Where it pertains to microovn, these symlinks are mainly for some certs and state information passed to ovn commands. They're set up here and here

If microovn is installed after LXD is started, absolutely nothing happens. LXD would not have set up the symlinks in a way that would work with microovn.

Also a side note - can we add an internal endpoint to LXD to ask it to refresh it's symlinks without requiring a restart?

That endpoint would have to have some way of executing either the daemon.start snap hook or some other snap action that can also be run when the daemon starts. I'm not familiar enough with how snaps work to know if this is possible.

@markylaing
Copy link
Contributor

markylaing commented Jul 14, 2023

  • We can't restart LXD during microcloud init. Why might this not be enough?

This would solve the problem for the init case, but it would cause a consistent delay when a user initially runs microcloud init. Not sure how bad this would be on slower systems.

I don't think it's too much of a concern to have a delay during microcloud init. The init command is interactive and takes a while anyway. We can indicate to the user what we are doing.

  • We can't restart LXD during microcloud add. Why do we need to restart the local LXD when adding a new member? Can we not check if we have an OVN service and if so restart LXD in the new member?

So when joining, we perform some actions both on the joining system that sends the join request and the clustered system that eventually handles it (dqlite leader, I believe). I believe currently the symlinks we care about are just for setting up local services for ovn, but technically both the cluster and the joiner might potentially want to use some of these paths.

I'm still confused about this, why would the existing member use symlinks on another machine?

Also a side note - can we add an internal endpoint to LXD to ask it to refresh it's symlinks without requiring a restart?

That endpoint would have to have some way of executing either the daemon.start snap hook or some other snap action that can also be run when the daemon starts. I'm not familiar enough with how snaps work to know if this is possible.

Presumably the daemon.start snap hook is executed in the same environment as LXD itself, otherwise it would be a security concern. Is that the case @simondeziel @tomponline?

If this is true then the symlinks created in the daemon.start hook could be created on the fly by LXD itself. We could add a /internal/detect-integrations (or similar) endpoint to LXD and call it from microcloud on all nodes during init, and on the joining node during add. Then there would be no reason to restart LXD.

Edit: Also since it would always be called by microcloud, we can remove it from LXDs daemon.start hook.

@masnax
Copy link
Contributor Author

masnax commented Jul 14, 2023

I'm still confused about this, why would the existing member use symlinks on another machine?

Not from another machine. It's a hypothetical edge case: A node with a symlink to something that's utilized once the node is already clustered, and is handling a join request. This node would've joined the cluster with incorrectly set up symlinks (because LXD started before the relevant service), and never happened to restart. It wouldn't have errored out when it joined because the hypothetical command is run on existing cluster members, so the invalid state on the joiner is irrelevant. Once the node is clustered, its state would be silently invalid until this node becomes dqlite leader and is forwarded a request to add a new cluster member.

If this is true then the symlinks created in the daemon.start hook could be created on the fly by LXD itself. We could add a /internal/detect-integrations (or similar) endpoint to LXD and call it from microcloud on all nodes during init, and on the joining node during add. Then there would be no reason to restart LXD.

Right, personally I think this is the most foolproof solution, but involves migrating all of this state logic over to go, unless there's some way to leave it in the snap configuration and have the endpoint execute a snap action which runs exactly what's currently in daemon.start.

@tomponline
Copy link
Member

I think this gets to the heart of the issue. Should LXD (beyond its own packaging) have knowledge of and interact with (even just passively) MicroCloud, or should we treat it as just another external client.

I would strongly caution against using /internal from anything except lxd itself, as otherwise it fundamentally changes what the /internal endpoints are for.

I think we should discuss this in our next microcloud meeting so we can all get on the same page, and consider both microovn and microceph at the same time to see if we can come up with a general solution.

@simondeziel
Copy link
Member

Those snap to snap interactions should probably be best discussed with the snap people (https://chat.canonical.com/canonical/channels/starcraft). It sounds to me that we would need some more interfaces automatically connected to let them interact.

@tomponline
Copy link
Member

Lets chat about it in our next meeting as it maybe we can avoid needing to do it in the snap at all.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
@masnax
Copy link
Contributor Author

masnax commented Jul 24, 2023

As discussed in our meeting, I've updated this PR to restart LXD in more places:

  1. When MicroCloud starts
  2. When microcloud init is called
  3. When a joining system receives its join token from the system that called microcloud init

This should suffice and avoid the nasty MicroOVN error until we can discuss a more permanent and resilient solution that doesn't rely on restarting LXD completely.

Sets LXD to restart at three points:
  1) When MicroCloud first starts (to avoid a potential lag on all
     peers when fetching resources as LXD sets up for the first time)
  2) When the user invokes `microcloud init`, to ensure LXD is set up
     properly with the correct symlinks for microcloud
  3) When a MicroCloud daemon receives a join request for LXD, to
     ensure the above for all joining nodes.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
@tomponline tomponline merged commit 8dff5af into canonical:main Jul 24, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants